Financial institutions offer a variety of services, including loan approval and disbursement to provide customers with liquid capital for major purchases. Major purchases can include anything from real estate to business PP&E to vehicles. The loan approval process is one that many customers worry about, as the requested loan may or may not be approved (in part or in full) by the institution. Banks use customers' previous and current financial data, as well as human judgment (in the form of loan officer recommendations) to decide whether or not to approve a loan for a customer. Financial data that is crucial to making those decisions include the amount and necessity of the loan, the customer's credit history, and the customer's current financial situation. According to Investopedia's description of loan grading, bigger banks tend to use more quantitative approaches to approximate credit risk, while smaller banks tend to use a more holitsic, human-based judgment system.
This dataset, retrieved from Kaggle, consists of vehicle loan data from Larsen & Toubro, a global conglomerate with a financial services arm, founded in Mumbai, India. The task is to predict whether or not a customer will default on the loan given a demographic, current loan, and past financial performance data - a binary classification task. To do so, we need to do exploratory data analysis via data visualization in order to discover which features are most important to determination of potential loan default (loan quality and credit risk).
The dataset includes many features, which can be categorized into the following sections:
One of the goals of exploring this dataset is to see what demographic, current loan, and past financial factors play a role in determining probability of loan default.
It is important to be able to predict potential loan default because financial institutions want to not only manage their lending capacity effectively, but also reduce occurrences of:
Both of these errors should be avoided in order for banks to remain financially successful in their loan segments.
Once models are built to accomplish this classification task, the prediction algorithm would need to perform better than the current method of data and human-judgment approval process, which, according to this dataset, is around 79% correct in approving loans for people who will not default on them. Of course, the higher percentage accuracy that can be achieved, the better.
Important to note is that there are a few ethical implications of having an algorithm or model decide loan quality, as explained by this article by White & Case LLP ("Algorithms and bias: What lenders need to know"). An algorithmic credit-scoring system helps financial institutions "optimize default and prepayment rates" and "streamlines the [loan] application process." This machine learning-driven approach seeks to remedy the flaws of a human judgment-based system, in which bankers making the approval decisions may have preconceived and subjective biases. However, algorithmic systems are not immune from bias - in fact, they may "inadvertently generate biased conclusions that discriminate against protected classes of people," resulting in certain classes of people automatically being restricted from loan approval even if they are in other ways good candidates for loan approval. Since an algorithmic approach to credit approval is often more of a black box, financial institutions should be wary of any unintended and discriminatory biases that the model might learn from the data, as this may lead to fair lending claims and reputational damage.
An example of a bias in an algorithmic system is the following, taken from the White & Case LLP article linked above:
Consider an algorithm programmed to examine and incorporate certain shopping patterns into its decision model. It may reject all loan applicants who shop primarily at a particular chain of grocery stores because an algorithm "learned" that shopping at those stores is correlated with a higher risk of default. But if those stores are disproportionately located in minority communities, the algorithm could have an adverse effect on minority applicants who are otherwise creditworthy.
Financial services firms must plan to test for and remove any bias that shows up in their algorithmic systems in order avoid legal and reputational risk.
import pandas as pd
import numpy as np
# load the dataset and view first few rows of training dataset
df = pd.read_csv('train.csv')
df.head()
# set all column names to lowercase and replace . with -
df = df.rename(str.lower, axis='columns')
df.columns = df.columns.str.replace('.','_')
df.info()
There are 233,154 auto loan customers in this dataset. The above information about the dataset shows that there is almost no missing data - there are 233,154 non-null values for most columns, except the employment_type field. Fields that start with pri pertain to primary accounts, those which the customer has taken for his personal use. Fields that start with sec pertain to secondary accounts, those which the customer has acted as a co-applicant or guarantor.
Many of the features are of the correct data type, but a few require additional preprocessing (conducted in the next section) in order to be usable. These include all of the date fields, which are currently strings and should be integer types (in terms of # of days / months / years since the date). In addition, customer age and credit score values can be segmented into different categories to produce categorical features.
assert len(df) == len(df.uniqueid.unique())
The cell above ensures that all customers (rows) are unique by checking to see that the number of rows matches the number of the number of unique IDs in the uniqueid field.
print("Percentage of vehicle loans defaulted on: ", sum(df.loan_default == 1) / len(df) * 100)
One thing to look out for when utilizing machine learning models in classification tasks is whether or not the dataset is imbalanced - if the number of instances of one class is very small or large in comparison to other classes. In the case of our binary classification task, about 22% of vehicle loans in the dataset resulted in a loan default. This shows that the dataset is not too drastically imbalanced.
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline
import missingno as mn
mn.matrix(df)
plt.show()
# get percentage of null rows
null_rows_df = df[df.employment_type.isnull()]
print("Number of rows with missing data: ", len(null_rows_df))
print("Percentage of rows with missing data: ", len(null_rows_df) / len(df) * 100)
print("Percentage of null rows where loans were defaulted on: ",
len(null_rows_df[null_rows_df.loan_default == 1]) / len(null_rows_df) * 100)
# drop rows with null values in employment type field
df_new = df.copy()
df_new = df_new.dropna()
print("New number of rows: ", len(df_new))
As mentioned above, the vast majority of data is present and non-null in the dataset. This can be confirmed by the filter bar missingno matrix, which shows only missing data (white bars) in the employment_type feature column.
For the rows in which employment_type is null, we drop those rows, as the categorical employment type (either Salaried or Self Employed) cannot be successfully imputed with any sort of certainty, and the affected rows only comprise 3% of the total dataset. In addition, the percentage of affected rows where the loan was defaulted on was about 21%, which is very close to the 22% of total rows where the loan was defaulted on.
Therefore, we decided to drop the rows with missing employment type data, instead of attempting to impute employment_type for those instances.
# change date of birth to more consumable age value
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
# code adapted from https://stackoverflow.com/questions/765797/python-timedelta-in-years
# to convert birthdate to years since birthdate
def convert_to_years(date):
datetime_val = pd.to_datetime(date, format='%d-%m-%y')
if datetime_val > datetime.now():
return relativedelta(datetime.now(), datetime_val).years + 100
return relativedelta(datetime.now(), datetime_val).years
df_new["customer_age"] = df_new.date_of_birth.apply(convert_to_years)
# https://stackoverflow.com/questions/4628122/how-to-construct-a-timedelta-object-from-a-simple-string
# convert average account age and credit history length from '_yrs _mon' format
# to number of months using regex matching
import re
regex = re.compile(r'(?P<year>\d+?)yrs (?P<month>\d+?)mon')
def parse_time_delta(time_delta_str):
parts = regex.match(time_delta_str)
if not parts:
return
parts = parts.groupdict()
time_params = {}
for (name, param) in parts.items():
if param:
time_params[name] = int(param)
return relativedelta(**time_params)
def get_num_months(time_delta_str):
delta = parse_time_delta(time_delta_str)
return int(delta.year) * 12 + delta.month
df_new["average_acct_age"] = df_new["average_acct_age"].apply(get_num_months)
df_new["credit_history_length"] = df_new["credit_history_length"].apply(get_num_months)
Some modifications were needed to make the date_of_birth, average_acct_age, and credit_history_length features usable. Date of birth was converted to customer_age (in years), while average account age and credit history length were converted from a custom string format ("_yrs _mon") to an integer representing number of months.
# delete any unnecessary columns
unnecessary_cols = ["uniqueid", "branch_id", "supplier_id", "manufacturer_id", "current_pincode_id",
"date_of_birth", "state_id", "employee_code_id", "disbursaldate"]
flags = ["mobileno_avl_flag", "aadhar_flag", "pan_flag", "voterid_flag", "driving_flag", "passport_flag"]
df_new = df_new.drop(columns=(unnecessary_cols + flags))
print("Number of new features (excluding the target variable): ", len(df_new.columns)-1)
df_new.head()
We removed columns / features that had data that was not useful to loan default determination, including categorical data flags that indicated whether or not the customer disclosed information such as mobiel number, voter ID, and driver's license numbers. Disbursal date would normally be useful, but we don't know the exact date this data was collected, and so it is difficult to extract useful information (ex. months_since_disbursal).
df_new["age_range"] = pd.cut(x=df_new["customer_age"],
bins=[0, 35, 50, 65, 1e6],
labels=['Young Career','Middle Career','Late Career','Retirement Age'])
df_new.age_range.describe()
df_new["credit_score_category"] = pd.cut(x=df_new["perform_cns_score"],
bins=[0, 450, 600, 750, 950],
labels=['Bad Credit','Poor Credit','Good Credit','Great Credit'])
df_new.credit_score_category.describe()
Above, we took two numeric features, customer age and credit bureau score, and extracted four categories for each feature to produce the age_range and credit_score_category categorical features. Simplifying age and credit scores down to four descriptions allows us to use crosstab operations in our data visualizations to see which subgroups have larger levels of default. Credit score breakdowns are based on this Credit Karma article.
This leaves us with two categorical features - employment_type and perform_cns_score_description, as well as 26 other numerical features. Not all 28 feature columns will be used in the visualizations and analysis below, but we are keeping them in the dataframe in case of future use.
print("Final Look at Preprocessed Dataset")
df_new.head()
primary_vars_to_use = ['pri_no_of_accts', 'pri_active_accts', 'pri_overdue_accts',
'pri_current_balance', 'pri_sanctioned_amount',
'pri_disbursed_amount', 'primary_instal_amt']
secondary_vars_to_use = ['sec_no_of_accts', 'sec_active_accts', 'sec_overdue_accts',
'sec_current_balance', 'sec_sanctioned_amount',
'sec_disbursed_amount', 'sec_instal_amt']
curr_loan_vars_to_use = ['disbursed_amount', 'asset_cost', 'ltv']
past_perf_vars_to_use = ['perform_cns_score', 'new_accts_in_last_six_months',
'delinquent_accts_in_last_six_months', 'average_acct_age', 'credit_history_length',
'no_of_inquiries']
customer_info_vars_to_use = ['employment_type', 'customer_age', 'age_range']
For easy access later, we segment the current features that we have into different groups based on what kind of information they contribute - primary loan, secondary loan, current loan, past performance, and customer demographic information. All features are numeric except the ones in customer_info_vars_to_use.
fig = plt.figure(figsize=(20,5))
plt.subplot(1,3,1)
df_new.customer_age.plot.hist(bins=20)
plt.title("Customer Age Histogram")
plt.xlabel("Customer Age (years)")
plt.subplot(1,3,2)
plt.title("Customer Age Kernel Density Estimation")
plt.xlabel("Customer Age (years)")
df_new.customer_age.plot.kde(bw_method=0.2)
plt.subplot(1,3,3)
df_new.customer_age.plot.hist(bins=20)
df_new.customer_age.plot.kde(bw_method=0.1, secondary_y=True)
plt.title("Customer Age Histogram and KDE")
plt.xlabel("Customer Age (years)")
plt.ylim([0, 0.06])
plt.show()
The histogram and kernel density estimation graphs for the customer_age numeric variable show that there are more younger (early-career and mid-career) individuals looking to get loans for vehicle purchases. This brings up the question of whether younger customers have good or bad financial performance metrics, and if they are more likely than older customers to default on loans.
fig = plt.figure(figsize=(20,5))
plt.subplot(1,3,1)
df_new.perform_cns_score.plot.hist(bins=20)
plt.title("Bureau Credit Score Histogram")
plt.xlabel("Bureau Credit Score (points)")
plt.subplot(1,3,2)
df_new.perform_cns_score.plot.kde(bw_method=0.2)
plt.title("Bureau Credit Score Histogram")
plt.xlabel("Bureau Credit Score (points)")
plt.subplot(1,3,3)
df_new.perform_cns_score.plot.hist(bins=20)
df_new.perform_cns_score.plot.kde(bw_method=0.1, secondary_y=True)
plt.title("Bureau Credit Score Histogram and KDE")
plt.xlabel("Bureau Credit Score (points)")
plt.ylim([0, 0.06])
plt.show()
score_risk_levels = ['A-Very Low Risk', 'B-Very Low Risk', 'C-Very Low Risk', 'D-Very Low Risk',
'E-Low Risk', 'F-Low Risk', 'G-Low Risk',
'H-Medium Risk', 'I-Medium Risk',
'J-High Risk', 'K-High Risk',
'L-Very High Risk', 'M-Very High Risk']
score_exception_levels = ['No Bureau History Available',
'Not Scored: Not Enough Info available on the customer',
'Not Scored: No Activity seen on the customer (Inactive)',
'Not Scored: Sufficient History Not Available',
'Not Scored: No Updates available in last 36 months',
'Not Scored: Only a Guarantor',
'Not Scored: More than 50 active Accounts found']
df_defined_risk_levels = df_new[df_new.perform_cns_score_description.isin(score_risk_levels)]
print("Number of instances with defined risk score category: ", len(df_defined_risk_levels))
df_undefined_risk_levels = df_new[df_new.perform_cns_score_description.isin(score_exception_levels)]
print("Number of instances with undefined risk score category: ", len(df_undefined_risk_levels))
print("Undefined risk level credit bureau scores: ", sorted(df_undefined_risk_levels.perform_cns_score.unique()))
The histogram and kernel density estimation graphs for the perform_cns_score numeric variable show that that more than half the customers in the dataset have a credit score of 0, which most likely correspond to the perform_cns_score_description values of "No Bureau History" or "Not Scored: {reason}." Score descriptions that were deemed defined and undefined (score_risk_levels and score_exception_levels, respectively) were listed above from the total number of unique descriptions. As seen above, undefined risk level credit bureau scores were all less than 20.
This means that in certain future data visualization and analysis concerning the credit scores, we may need to remove all instances where customers have undefined risk category descriptions (listed above in score_exception_levels), whether this is due to very poor financial performance or customers who are just starting out building credit.
import seaborn as sns
df_add_ltv = df_new[['employment_type','customer_age','age_range','ltv']]
sns.barplot(x=df_add_ltv.employment_type, y=df_add_ltv.ltv, hue="age_range",data=df_add_ltv)
plt.title("Measuring LTV ratio against Customer Employment and Age Range")
plt.show()
We want to analyze customer age and employment type in relation to ltv, which stands for the loan-to-value ratio and incorporates information about the current loan in question. Based on the barplot, we can see that career experience may be a determinant when it comes to ltv and how much a bank is willing to grant to an individual seeking a loan. The middle career and late career groups seem to have higher ltv compared to young career and retirement-aged customers, indicating financial stability is something that banks look at in the credit underwriting process.
Next, we want to visualize customer age and employment type in relation to our target variable, the binary variable loan_default.
df_sample = df_new.sample(frac=0.10)
print("Size of subsample: ", len(df_sample))
We have a large number of instances, so to speed up the computation of our box, violin, and swarmplots, we decided to randomly sample annd use only 10% of the data for the following visualizations. 10% of our dataset is around 22,500 instances, which is still a nontrivial amount of data.
plt.subplots(figsize=(20, 8))
plt.subplot(1,3,1)
sns.boxplot(x="employment_type", y="customer_age", hue="loan_default", data=df_sample)
plt.title('Customer Age vs. Employment Type Boxplot')
plt.subplot(1,3,2)
sns.violinplot(x="employment_type", y="customer_age", hue="loan_default", data=df_sample)
plt.title('Customer Age vs. Employment Type Violinplot')
plt.subplot(1,3,3)
sns.swarmplot(x="employment_type", y="customer_age", hue="loan_default", data=df_sample)
plt.title('Customer Age vs. Employment Type Swarmplot')
plt.show()
Here, we plot a boxplot, violinplot, and swarmplot plotting employment_type against customer_age. We decided to plot the boxplot because we can examine the overall distribution of our customer age variable. We added the violin and swarm plots to examine any unusual modality or clusters in the data, and because boxplots alone are not reliable, as many different distributions can have the same boxplot and we may be missing underlying features of the distribution.
By comparing these two variables and looking at the instances of loan default in this context:
vars_to_use = ['loan_default', 'customer_age', 'employment_type', 'perform_cns_score',
'disbursed_amount', 'asset_cost', 'ltv',
'pri_no_of_accts', 'pri_active_accts', 'pri_overdue_accts', 'pri_current_balance',
'new_accts_in_last_six_months', 'delinquent_accts_in_last_six_months',
'no_of_inquiries']
df_important_cols = df_new[vars_to_use]
import seaborn as sns
cmap = sns.set(style="darkgrid")
f, ax = plt.subplots(figsize=(15, 10))
sns.heatmap(df_important_cols.corr(), cmap=cmap, annot=True)
plt.title("Selected Feature Correlation Matrix")
plt.show()
Here, we've visualized a correlation matrix between a subset of the dataset features that we deemed most important to determination of potential loan default (based on description of the features in the dataset's included metadata). There are some highly positive correlations that are expected - for example, the disbursed amount for the loan is going to be greater as the vehicle asset cost increases, since customers will likely request a greater loan, and the number of primary active accounts is directly related to the number of accounts opened in the last six months.
Of particular interest here are the following observations:
loan_default target variable is binary and therefore not very conducive to correlation matrix comparison. age_to_default_percentage = pd.crosstab([df_new['age_range']], df_new.loan_default.astype(bool), normalize='index')
print(age_to_default_percentage)
age_to_default_percentage.plot(kind='barh', stacked=True)
plt.title("Percentage of Loan Defaults based on Age Range Subgroups")
plt.xlabel("Percentage of No Loan Default vs. Loan Default in Subgroup")
plt.show()
This crosstab operation, which groups customer instances by age_range, shows that as customer age increases, the less likely loan default is to happen. Young career individuals have the greatest percentage of defaults, whereas retirement age individuals have the lowest percentage. This is in line with a general assumption that older customers have more financial stability due to accumulation of wealth and financial knowledge over their careers.
emp_default = pd.crosstab([df_defined_risk_levels['employment_type']],
df_defined_risk_levels.loan_default.astype(bool),
normalize='index')
print(emp_default)
emp_default.plot(kind='barh', stacked=True)
plt.title("Percentage of Loan Defaults based on Credit Score Category Subgroups")
plt.xlabel("Percentage of No Loan Default vs. Loan Default in Subgroup")
plt.show()
This crosstab operation, which groups customer instances by employment_type, shows salaried individuals are slightly less likely to default on loans than self-employed individuals. This is in line with a general assumption that salaried customers tend to probably have more financial stability than customers who are self-employed.
score_default = pd.crosstab([df_new['perform_cns_score_description']],
df_new.loan_default.astype(bool),
normalize='index')
score_default.plot(kind='barh', stacked=True, figsize=(8,6))
plt.title("Percentage of Loan Defaults based on Credit Score Category Subgroups")
plt.xlabel("Percentage of No Loan Default vs. Loan Default in Subgroup")
plt.show()
The graph shown above visualizes the percentage of customers in each original bureau score subgroup that do and do not default on their vehicle loans. As one might expect, loan_default is generally correlated with risk category level. The highest percentage of loan default came from customers classified as M-Very High Risk, and the lowest percentage of loan default came from customers classified as B-Very Low Risk. As you increase the risk category level from A-M and Very Low Risk-Very High Risk, there is a gradual increase in percentage of default. An interesting observation is that customers classified as I-Medium Risk tended to have similar default percentages to K-High Risk and L-Very High Risk at around 27%, which indicates that defaulting customers in this bureau score level may be eventually shifted to a higher risk score as the loan default information will reflect on their future financial scores.
score_default = pd.crosstab([df_new['age_range'], df_new['credit_score_category']],
df_new.loan_default.astype(bool),
normalize='index')
score_default.plot(kind='barh', stacked=True, figsize=(8,6))
plt.title("Percentage of Loan Defaults based on Credit Score and Age Range Subgroups")
plt.xlabel("Percentage of No Loan Default vs. Loan Default in Subgroup")
plt.show()
To finalize our analysis for this question, we created a plot that creates subgroups for each categorical credit score category and age range categorical pair and calculated the percentage of loan default in each subgroup. Not surprisingly, the subgroups with bad or poor credit had more loan default instances than subgroups with good or great credit. We've seen from previous analyses that age range does play a role in loan default, but this chart shows that past financial performance (in the form of a credit score) has an even greater role.
df_primary_jitter = df_new[primary_vars_to_use].copy()
df_primary_jitter['loan_default'] = df_new.loan_default
sns.pairplot(df_primary_jitter, hue="loan_default", vars=primary_vars_to_use,
height=2, plot_kws=dict(s=20, alpha=0.15, linewidth=0))
plt.show()
df_secondary_jitter = df_new[secondary_vars_to_use].copy()
df_secondary_jitter['loan_default'] = df_new.loan_default
sns.pairplot(df_secondary_jitter, hue="loan_default", vars=secondary_vars_to_use,
height=2, plot_kws=dict(s=20, alpha=0.15, linewidth=0))
plt.show()
Here, we visualize two scatter plots, one with past primary loan information and one with past secondary loan information. We wanted to see if there were any groups of primary or secondary loan features that would be useful for inclusion in principal component analysis and future classification, since these features were more than half of the dataset and we wanted to reduce the dataset's dimensionality.
What we can see is that there are certain expected linear relationships (i.e. between current_balance, sanctioned_amount, and disbursed_amount in both the primary and secondary set of features), but there is no very clear indication of separation in the plots between instances of loan default and instances of no loan default, which means that we should use PCA or some other clustering algorithms to more easily find "groups" of loan customers, from which banks can decide to whether to approve loans for based on the group's feature characteristics.
The most interesting observation from these plots is that "no default" instance customers tend to have larger values for most of the features. Interestingly, there are customers with very large current balances and number of overdue accounts, and still yet manage to not default on their vehicle loan. This could either be an indication that they are prioritizing this loan (since it is a large purchase), or they had not yet defaulted at the time this data was gathered. Normally, a greater number of loans would mean increased credit risk, as these people are more levered / in debt, but it turns out that it might also mean that these customers are more cognizant of their financial situation and are working hard to avoid default.
Sub-question: There is a lot of primary and secondary loan information which inflates the number of features. Are all of these necessary? Why or why not?
import plotly.express as px
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
pca_chosen_features = list(df_new.select_dtypes(include=['number']))
pca_chosen_features.remove('loan_default')
df_sklearn = df_new[pca_chosen_features]
df_y = df_new['loan_default']
# plot explained variance
pca = PCA()
pca.fit(df_sklearn)
exp_var_cumul = np.cumsum(pca.explained_variance_ratio_)
px.area(
x=range(1, exp_var_cumul.shape[0] + 1),
y=exp_var_cumul,
labels={"x": "# Components", "y": "Explained Variance"}
)
# pca code adapted from https://plotly.com/python/pca-visualization/
# visualizing a subset of the principal components that explains almost 99% of the total variance
n_components = 2
pca = PCA(n_components=n_components)
components = pca.fit_transform(df_sklearn)
total_var = pca.explained_variance_ratio_.sum() * 100
labels = {str(i): f"PC {i+1}" for i in range(n_components)}
labels['color'] = 'loan_default'
fig = px.scatter_matrix(
components,
color=df_y,
dimensions=range(n_components),
labels=labels,
title=f'Total Explained Variance: {total_var:.2f}%',
)
fig.update_traces(diagonal_visible=False)
As a segway to the next section where we use UMAP to conduct principal component analysis, we utilize the PCA implementation from sklearn and plotting functions from plotly to answer our question of what weighted combination of numeric features best describes our data.
We first take all of the numeric features from the original dataset (which is 24, or a majority, of our features) and ran PCA with no parameters to find how many components we should specify. The explained variance ratio graph shows that almost 99% of the variation in the data can be represented by only two principal components, so we set n_components to 2.
Our results are less than satisfactory - there are no clear clusters that represent individuals that default on vehicle loans and individuals that do not. This could be due to our dataset not being linearly distributed and being best described with geometric parameters. In this case, kernel PCA may be another approach to find principal components [https://www.cs.cmu.edu/~elaw/papers/pca.pdf].
# code adapted from https://umap-learn.readthedocs.io/en/latest/basic_usage.html
import plotly.express as px
from sklearn.decomposition import PCA
import umap.umap_ as umap
# separate data and target
df_umap = df_new.copy()
df_umap = df_umap[pca_chosen_features]
df_umap["loan_default"] = df_new.loan_default
df_umap = df_umap.sample(frac=0.10)
X = df_umap.loc[:, df_umap.columns != "loan_default"]
y = df_umap["loan_default"]
# declare reducer and call fit / transform to get embeddings
reducer = umap.UMAP(random_state=42, n_neighbors=10, min_dist=0.1)
reducer.fit(X)
embedding = reducer.transform(X)
print(embedding.shape)
# plot umap results
plt.scatter(embedding[:, 0], embedding[:, 1], c=y, cmap='Spectral', s=5)
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(3)-0.5).set_ticks(np.arange(2))
plt.title('UMAP projection of Vehicle Loan Dataset', fontsize=12);
UMAP stands for Uniform Manifold Approximation and Projection. It is a dimension reduction technique that can be used for visualization but also for general non-linear dimension reduction. The UMAP algorithm can be described as two different phases:
Eventually, we get a fast PCA algorithm that can handle large datasets and high dimensional data without too much difficulty, scaling beyond what most t-SNE (t-Distributed Stochastic Neighbor Embedding) packages can manage.
We use the UMAP library to construct and train a model on all of the numeric features in the dataset, then plot the results of UMAP embeddings. Due to limited computational power, we only utilize a subset of instances that is randomly sampled from the entire dataset. Similar to our PCA analysis, there are not discrete clusters that represent either of the target classes. The radial nature of the resultant UMAP plotting is possibly an indication that the dataset's features are non-linear.
From our two dimensionality reduction experiments, we can tentatively conclude that using PCA on this dataset is not ideal. There are more work and other techniques that can be used to achieve results similar to those we expected from the PCA algorithms.
@article{mcinnes2018umap-software, title={UMAP: Uniform Manifold Approximation and Projection}, author={McInnes, Leland and Healy, John and Saul, Nathaniel and Grossberger, Lukas}, journal={The Journal of Open Source Software}, volume={3}, number={29}, pages={861}, year={2018} }